Data Deduplication in Parallel Mining of Frequent Item sets using MapReduce

ثبت نشده

چکیده

A Parallel Frequent Item sets mining algorithm called FiDoop using MapReduce programming model. FiDoop includes the frequent items ultrametric tree(FIU-tree), in that three MapReduce jobs are applied to complete the mining task. The scalability problem has been addressed bythe implementation of a handful of FP-growth-like parallelFIM algorithms. InFiDoop, the mappers independently and concurrently decompose item sets; the reducers perform combination operationsby constructing small ultrametric trees as well as miningthese trees in parallel. Data Deduplication is one of important data compression method for erasing duplicate copies of repeating data and reduce the amount of storage space and save bandwidth.The technique is used to improve storage space utilization and can also be applied to reduce the number of bytes. The first MapReduce job discovers all frequent items, the second MapReduce job scans the database to generate k-item sets by removing infrequent items, and the third MapReduce job complicated one to constructs k-FIU-tree and mines all frequent k-item sets. In this paper, we applying Deduplication technique in third MapReduce job to avoid the replication of data in frequent item sets and improve the performance. It produces highly related mining results with less time and increase the storage capacity. Hadoop supports nine different tools, while Mahout is based on core algorithm and classifications. Having sequence algorithm to produce the output in better way. We aim to implement recommendation algorithm using Mahout, a machine learning device, on Hadoop platform to provide a scalable system for processing large data sets efficiently. This can be performed on such platforms for quicker performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Technique Of Extracting Frequent Itemsets From Massive Data Using MapReduce

The mining of frequent itemsets is a basic and essential work in many data mining applications. Frequent itemsets extraction with frequent pattern and rules boosts the applications like Association rule mining, co-relations also in product sale and marketing. In extraction process of frequent itemsets there are number of algorithms used Like FP-growth,E-clat etc. But unfortunately these algorit...

متن کامل

Parallel Rule Mining with Dynamic Data Distribution under Heterogeneous Cluster Environment

Big data mining methods supports knowledge discovery on high scalable, high volume and high velocity data elements. The cloud computing environment provides computational and storage resources for the big data mining process. Hadoop is a widely used parallel and distributed computing platform for big data analysis and manages the homogeneous and heterogeneous computing models. The MapReduce fra...

متن کامل

Weighted Itemset Mining from Bigdata using Hadoop

Data items have been extracted using an empirical data mining technique called frequent itemset mining. In majority of theapplication contexts items are enriched with weights. Pushing an item weights into the itemset extraction process, i.e., mining weighted itemsets rather than traditional itemsets, is an appealing research direction. Although many efficient weighteditemset mining algorithms a...

متن کامل

Performance Evaluation of Apriori Algorithm on a Hadoop Cluster

Frequent Itemset Mining is a well-known concept in data sciences. If we feed frequent itemset miner algorithms with large datasets they become resource hungry fast as their search space explodes. This problem is even more apparent when we try to use them on Big Data. Recent advances in parallel programming provides good solutions to deal with large datasets but they present their own problems w...

متن کامل

Mining Frequent Item Sets Using Map Reduce Paradigm

In Text categorization techniques like Text classification or clustering, finding frequent item sets is an acquainted method in the current research trends. Even though finding frequent item sets using Apriori algorithm is a widespread method, later DHP, partitioning, sampling, DIC, Eclat, FP-growth, H-mine algorithms were shown better performance than Apriori in standalone systems. In real sce...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Data Deduplication in Parallel Mining of Frequent Item sets using MapReduce

ثبت نشده

چکیده

منابع مشابه

An Improved Technique Of Extracting Frequent Itemsets From Massive Data Using MapReduce

Parallel Rule Mining with Dynamic Data Distribution under Heterogeneous Cluster Environment

Weighted Itemset Mining from Bigdata using Hadoop

Performance Evaluation of Apriori Algorithm on a Hadoop Cluster

Mining Frequent Item Sets Using Map Reduce Paradigm

عنوان ژورنال:

اشتراک گذاری